We will re-use the colon cancer data in GSE33126
library(GEOquery)
url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/"
filenm <- "GSE33126_series_matrix.txt.gz"
if(!file.exists("GSE33126_series_matrix.txt.gz")) download.file(paste(url, filenm, sep=""), destfile=filenm)
colonData <- getGEO(filename=filenm)
## File stored at:
## /tmp/RtmpsdXCPn/GPL6947.soft
colonData
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 48803 features, 18 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM820516 GSM820517 ... GSM820533 (18 total)
## varLabels: title geo_accession ... data_row_count (31 total)
## varMetadata: labelDescription
## featureData
## featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48803
## total)
## fvarLabels: ID nuID ... GB_ACC (30 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: GPL6947
exprs (colonData) <- log2 (exprs(colonData))
SampleGroup <- pData(colonData)$source_name_ch1
Patient <- pData(colonData)$characteristics_ch1.1
A first step in the analysis of microarray data is often to remove any uniformative probes. We can do this because typically only 50% of probes genes will be expressed, and even fewer than this will be differentially expressed. Including such non-informative genes in visualisa- tion will obscure any biological differences that exist. The genefilter package contains a suite of tools for the filtering of microarray data. The varFilter function allows probes with low- variance to be removed from the dataset. The metric using to decide which probes to remove is the Inter-Quartile Range (IQR), and by default half of the probes are removed. Both the function used to do the filtering, and cut-off can be specified by the user.
library (genefilter)
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:base':
##
## anyNA
dim (colonData)
## Features Samples
## 48803 18
varFiltered <- varFilter (colonData)
dim (varFiltered)
## Features Samples
## 24401 18
nrow (colonData) / nrow (varFiltered)
## Features
## 2.000041
The first step towards clustering the data is to construct a distance matrix. Each entry in this matrix is the pairwise distance between two samples. Note that for expression data we have to transpose the standard ExpressionSet format, as clustering is designed to work on rows rather than columns. The standard function to make a distance matrix in R is dist which uses the euclidean metric. As the data entries in a distance matrix are symmetrical, it has a different representation to the default matrix (i.e. values are not repeated unnecessar- ily), and clustering algorithms are able to accept this representation as input.
euc.dist <- dist (t(exprs(varFiltered)))
euc.dist
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820517 146.35076
## GSM820518 112.10192 145.53738
## GSM820519 127.02787 111.88327 128.89085
## GSM820520 103.88617 145.88415 93.26812 124.41043
## GSM820521 115.74149 109.59666 112.44568 77.60015 110.62329
## GSM820522 86.59678 140.12151 115.68936 131.12234 98.96944 117.26705
## GSM820523 118.07552 115.02450 139.86229 101.32561 134.43368 89.05971
## GSM820524 113.05465 135.64296 106.83091 126.25701 85.61877 117.19595
## GSM820525 142.82777 72.50357 145.43070 101.26515 138.18094 107.93645
## GSM820526 87.68842 160.67923 109.56682 126.28151 102.83077 114.59309
## GSM820527 102.51958 121.92622 119.62366 78.01042 110.89035 69.13310
## GSM820528 85.71746 132.44295 96.04791 115.36869 75.14097 103.00987
## GSM820529 107.07671 118.90261 120.08406 94.25470 114.46866 79.01849
## GSM820530 80.36987 151.46130 94.29680 123.97811 86.59315 103.45768
## GSM820531 108.26678 100.31635 115.03569 82.27479 112.72244 64.87145
## GSM820532 128.93571 95.33138 115.04522 106.11806 116.81059 102.43630
## GSM820533 106.21944 134.12719 115.78666 97.39843 115.37280 68.75089
## GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820517
## GSM820518
## GSM820519
## GSM820520
## GSM820521
## GSM820522
## GSM820523 105.62731
## GSM820524 85.85654 116.10403
## GSM820525 127.62848 99.13618 120.46802
## GSM820526 107.40340 134.03152 114.20701 154.74502
## GSM820527 106.32390 77.38321 115.56748 109.20371 104.05182
## GSM820528 89.80380 112.44894 90.45972 124.46834 97.24261 90.59615
## GSM820529 116.33941 83.80927 123.55735 107.46869 119.74686 59.39630
## GSM820530 97.19378 126.82149 107.83406 146.18764 87.43164 101.35185
## GSM820531 116.26826 91.51850 122.61883 103.07141 119.87820 70.55040
## GSM820532 124.92807 122.94258 116.57744 103.39879 137.24907 113.49486
## GSM820533 116.89886 93.17867 127.75562 130.21971 109.87582 64.16850
## GSM820528 GSM820529 GSM820530 GSM820531 GSM820532
## GSM820517
## GSM820518
## GSM820519
## GSM820520
## GSM820521
## GSM820522
## GSM820523
## GSM820524
## GSM820525
## GSM820526
## GSM820527
## GSM820528
## GSM820529 86.75871
## GSM820530 81.22235 103.68352
## GSM820531 96.07646 70.24274 99.39146
## GSM820532 105.14902 109.33036 120.61473 93.09175
## GSM820533 99.91990 68.56935 96.27636 73.49531 119.19071
For gene-expression data, it is common to use correlation as a distance metric rather than the Euclidean. You should make sure that you know the difference between the two metrics. The cor function can be used to calculate the correlation of columns in a matrix. Each row (or column) in the resulting matrix is the correlation of that sample with all other samples in the dataset. The matrix is symmetrical and we can transform this into a distance matrix by first subtracting 1 from the correlation matrix. Hence, samples with a higher correlation have a smaller ’distance’.
corMat <- cor(exprs(varFiltered))
corMat
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521
## GSM820516 1.0000000 0.8582704 0.9171976 0.8937222 0.9286023 0.9112575
## GSM820517 0.8582704 1.0000000 0.8600812 0.9173538 0.8588151 0.9202012
## GSM820518 0.9171976 0.8600812 1.0000000 0.8907537 0.9425576 0.9163944
## GSM820519 0.8937222 0.9173538 0.8907537 1.0000000 0.8978211 0.9602207
## GSM820520 0.9286023 0.8588151 0.9425576 0.8978211 1.0000000 0.9187180
## GSM820521 0.9112575 0.9202012 0.9163944 0.9602207 0.9187180 1.0000000
## GSM820522 0.9503610 0.8696543 0.9115547 0.8864231 0.9349881 0.9085925
## GSM820523 0.9081569 0.9126298 0.8713382 0.9324985 0.8806670 0.9475820
## GSM820524 0.9156037 0.8781899 0.9247661 0.8949583 0.9514800 0.9089643
## GSM820525 0.8648318 0.9650722 0.8601046 0.9322218 0.8731558 0.9224893
## GSM820526 0.9493074 0.8293478 0.9209805 0.8950744 0.9301280 0.9131154
## GSM820527 0.9302064 0.9009531 0.9051381 0.9597308 0.9180982 0.9681213
## GSM820528 0.9511778 0.8829546 0.9388145 0.9117054 0.9623691 0.9291090
## GSM820529 0.9241083 0.9061505 0.9047172 0.9413361 0.9130390 0.9584997
## GSM820530 0.9573796 0.8482370 0.9414236 0.8987844 0.9504108 0.9291209
## GSM820531 0.9223823 0.9331725 0.9125297 0.9552921 0.9156400 0.9720189
## GSM820532 0.8897132 0.9395367 0.9123688 0.9254827 0.9092395 0.9300936
## GSM820533 0.9255052 0.8809038 0.9116292 0.9374967 0.9118986 0.9686817
## GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
## GSM820516 0.9503610 0.9081569 0.9156037 0.8648318 0.9493074 0.9302064
## GSM820517 0.8696543 0.9126298 0.8781899 0.9650722 0.8293478 0.9009531
## GSM820518 0.9115547 0.8713382 0.9247661 0.8601046 0.9209805 0.9051381
## GSM820519 0.8864231 0.9324985 0.8949583 0.9322218 0.8950744 0.9597308
## GSM820520 0.9349881 0.8806670 0.9514800 0.8731558 0.9301280 0.9180982
## GSM820521 0.9085925 0.9475820 0.9089643 0.9224893 0.9131154 0.9681213
## GSM820522 1.0000000 0.9262929 0.9511794 0.8917089 0.9237271 0.9246435
## GSM820523 0.9262929 1.0000000 0.9111565 0.9350314 0.8817770 0.9603706
## GSM820524 0.9511794 0.9111565 1.0000000 0.9037952 0.9139681 0.9112436
## GSM820525 0.8917089 0.9350314 0.9037952 1.0000000 0.8415122 0.9204252
## GSM820526 0.9237271 0.8817770 0.9139681 0.8415122 1.0000000 0.9282016
## GSM820527 0.9246435 0.9603706 0.9112436 0.9204252 0.9282016 1.0000000
## GSM820528 0.9461775 0.9161095 0.9455812 0.8964670 0.9372344 0.9449786
## GSM820529 0.9101066 0.9536168 0.8988913 0.9232237 0.9051941 0.9764954
## GSM820530 0.9374851 0.8940675 0.9232363 0.8584350 0.9496136 0.9318156
## GSM820531 0.9101826 0.9446634 0.9003839 0.9293514 0.9049499 0.9668194
## GSM820532 0.8961047 0.8999407 0.9097986 0.9287621 0.8751776 0.9139221
## GSM820533 0.9094910 0.9427849 0.8921813 0.8875923 0.9203758 0.9726729
## GSM820528 GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
## GSM820516 0.9511778 0.9241083 0.9573796 0.9223823 0.8897132 0.9255052
## GSM820517 0.8829546 0.9061505 0.8482370 0.9331725 0.9395367 0.8809038
## GSM820518 0.9388145 0.9047172 0.9414236 0.9125297 0.9123688 0.9116292
## GSM820519 0.9117054 0.9413361 0.8987844 0.9552921 0.9254827 0.9374967
## GSM820520 0.9623691 0.9130390 0.9504108 0.9156400 0.9092395 0.9118986
## GSM820521 0.9291090 0.9584997 0.9291209 0.9720189 0.9300936 0.9686817
## GSM820522 0.9461775 0.9101066 0.9374851 0.9101826 0.8961047 0.9094910
## GSM820523 0.9161095 0.9536168 0.8940675 0.9446634 0.8999407 0.9427849
## GSM820524 0.9455812 0.8988913 0.9232363 0.9003839 0.9097986 0.8921813
## GSM820525 0.8964670 0.9232237 0.8584350 0.9293514 0.9287621 0.8875923
## GSM820526 0.9372344 0.9051941 0.9496136 0.9049499 0.8751776 0.9203758
## GSM820527 0.9449786 0.9764954 0.9318156 0.9668194 0.9139221 0.9726729
## GSM820528 1.0000000 0.9497681 0.9561975 0.9383683 0.9260060 0.9335861
## GSM820529 0.9497681 1.0000000 0.9288649 0.9672203 0.9204350 0.9688689
## GSM820530 0.9561975 0.9288649 1.0000000 0.9346090 0.9035232 0.9388144
## GSM820531 0.9383683 0.9672203 0.9346090 1.0000000 0.9422943 0.9642204
## GSM820532 0.9260060 0.9204350 0.9035232 0.9422943 1.0000000 0.9057118
## GSM820533 0.9335861 0.9688689 0.9388144 0.9642204 0.9057118 1.0000000
cor.dist <- as.dist(1 - corMat)
Hierachical clustering is the method by which samples with similar profiles are grouped to- gether, based on their distances. The method for grouping samples can be specified, with the default being to use complete linkage. Different linkage methods can affect the properties of the clustering output. e.g. the ’ward’ method tends to produce smaller, compact clusters. A popular way of displaying the results of clustering is a dendrogram, which arranges the sam- ples on the x-axis and shows the distances between samples on the y-axis. It is important to note that the distance of samples on the x-axis has no meaning. The fact that two samples are displayed next to each other, does not m
par(mfrow = c (1 , 2))
clust.euclid = hclust(euc.dist)
clust.cor = hclust (cor.dist)
par (mfrow = c (1 , 2))
plot(clust.euclid , label = SampleGroup )
plot(clust.cor , label = SampleGroup )
If we want to interpret the data presented in a clustering analysis, we need a way of extract- ing which samples are grouped together, or to determine the optimal grouping of samples. One intuitive way of assigning groups it to ’cut’ the dendrogram at a particular height on the y-axis. We can do this manually on the plot, or use the cutree function to return the labels of samples that are belong to the same group when the dendrogram is cut at the spec- ified height, h. Alternatively, we can specify how many groups, k, that we want to create.
library (cluster)
par (mfrow = c(1 , 1))
plot(clust.cor)
abline (h = 0.12, col = " red ")
cutree (clust.cor , h = 0.12)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522
## 1 2 1 2 1 2 1
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529
## 2 1 2 1 2 1 2
## GSM820530 GSM820531 GSM820532 GSM820533
## 1 2 2 2
cutree (clust.cor , k = 2)
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522
## 1 2 1 2 1 2 1
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529
## 2 1 2 1 2 1 2
## GSM820530 GSM820531 GSM820532 GSM820533
## 1 2 2 2
table (cutree(clust.cor , k = 3) , SampleGroup)
## SampleGroup
## normal tumor
## 1 0 8
## 2 2 1
## 3 7 0
A Silhouette plot can be used to choose the optimal number of clusters. For each sample, we calculate a value that quantifies how well it ’fits’ the cluster that it has been assigned to. If the value is around 1, then the sample closely fits other samples in the same cluster. How- ever, if the value is around 0 the sample could belong to another cluster. In the silhouette plot, the values for each cluster are plotted together and ordered from largest to smallest. The number of samples belonging to each group is also displayed.
par (mfrow = c (2 , 2))
plot (silhouette(cutree(clust.cor, k=2),cor.dist),
col="red",main=paste("k=",2))
plot (silhouette(cutree(clust.cor, k=3),cor.dist),
col="red",main=paste("k=",3))
plot (silhouette(cutree(clust.cor, k=4),cor.dist),
col="red",main=paste("k=",4))
plot (silhouette(cutree(clust.cor, k=5),cor.dist),
col="red",main=paste("k=",5))
If we have prior knowledge of how many clusters to expect, we could run the clustering in a supervised manner. The Partition Around Medioids method can be used to group samples into k clusters.
pam.clus <- pam (euc.dist , k = 2)
clusplot (pam.clus)
pam.clus$clustering
## GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522
## 1 2 1 2 1 2 1
## GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529
## 2 1 2 1 2 1 2
## GSM820530 GSM820531 GSM820532 GSM820533
## 1 2 2 2
table(pam.clus$clustering , SampleGroup)
## SampleGroup
## normal tumor
## 1 0 8
## 2 9 1
A heatmap is often used to visualise differences between samples. Each row represents a gene and each column is an array and coloured cells indicate the expression levels of genes. Both samples and genes with similar expression profile are clustered together. By default, euclidean distances are used with complete linkage clustering. Drawing a heatmap in R uses a lot of memory and can take a long time, therefore reduc- ing the amount of data to be plotted is usually recommended. Including too many non- informative genes can also make it difficult to spot patterns. Typically, data are filtered to include the genes which tell us the most about the biological variation. In an un-supervised setting, the selection of such genes is done without using prior knowledge about the sample groupings.
IQRs = apply (exprs(varFiltered) , 1 , IQR )
highVarGenes = order (IQRs, decreasing = T )[1:100]
Symbols <- as.character(fData(colonData)$Symbol[highVarGenes])
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]),
labCol = SampleGroup , labRow = Symbols)
The default options for the heatmap are to cluster both the genes (rows) and samples (columns). However, sometimes we might want to specify a particular order. For example, we might want to order the columns according to sample groups. We can do this by re-ordering the input matrix manually and setting the Colv to NA. This tells the heatmap function not be cluster the columns. It choosing this option, we need to be careful that any column colour- ings or labels are set to the same ordering. Alternatively, a pre-calculated dendrogram could be used.
clus.ward <- hclust (cor.dist , method = "ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
Colv = as.dendrogram(clus.ward) , labCol = SampleGroup )
The heatmap function can be customised in many ways to make the output more informa- tive. For example, the labRow and labCol parameters can be used to give labels to the rows (genes) and columns (sample) of the heatmap. Similarly, ColSideColors and RowSideCol- ors give coloured labels, often used to indicate different groups which are know in advance. See the help page for heatmap for more details.
labs <- as.factor(SampleGroup)
levels(labs) <- c ("yellow" , "blue")
heatmap(as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
labCol = Patient, ColSideColors = as.character(labs),
labRow = Symbols)
The colours used to display the gene expression values can also be modified. For this, we can use the RColorBrewer package which has functions for creating pre-defined palettes. The function display.brewer.all can be used to display the palettes available through this package.
library (RColorBrewer)
display.brewer.all()
hmcol <- brewer.pal(11 , "RdBu")
heatmap (as.matrix(exprs(varFiltered)[highVarGenes, ]) ,
ColSideColors = as.character(labs) , labRow = Symbols,
col=hmcol)
You should avoid using the traditional red / green colour scheme as it may be difficult for people with colour-blindness to interpret
A popular use for heatmaps is to take an existing gene list (e.g. genes found to be significant in a previous study, or genes belonging to a particular pathway) and produce an image of how they cluster the data for exploratory purposes. This can be achieved easily by selecting the correct rows in the data matrix.
library (illuminaHumanv3.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
pathwayGenes <- unlist (mget("04110" , revmap(illuminaHumanv3PATH)))
pathwayGenes <- pathwayGenes [pathwayGenes %in% featureNames(varFiltered)]
symbols <- fData(varFiltered)[pathwayGenes , "Symbol"]
heatmap (as.matrix(exprs(varFiltered)[pathwayGenes , ]),
ColSideColors = as.character (labs) , labCol = Patient ,
labRow = symbols , col = hmcol )
Principal components analysis (PCA) is a data reduction technique that allows us to sim- plify multidimensional data sets to 2 or 3 dimensions for plotting purposes and identify which factors explain the most variability in the data. We can use the prcomp function to perform a PCA and we have to supply it with a distance matrix. The resulting object contains infor- mation on the proportion of variance that is ’explained’ by each component. Ideally, we want to see that the majority of the variance is explained by the first 2 or 3 components, and that these components are associated with a biological factor
pca <- prcomp(exprs(varFiltered))
plot(pca)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 7.171 1.13645 0.84320 0.6838 0.53334 0.50432
## Proportion of Variance 0.924 0.02321 0.01278 0.0084 0.00511 0.00457
## Cumulative Proportion 0.924 0.94724 0.96002 0.9684 0.97353 0.97810
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.44891 0.40378 0.38232 0.36830 0.32469 0.30894
## Proportion of Variance 0.00362 0.00293 0.00263 0.00244 0.00189 0.00172
## Cumulative Proportion 0.98172 0.98465 0.98728 0.98972 0.99161 0.99333
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.27589 0.27127 0.25520 0.24435 0.22922 0.2104
## Proportion of Variance 0.00137 0.00132 0.00117 0.00107 0.00094 0.0008
## Cumulative Proportion 0.99469 0.99602 0.99719 0.99826 0.99920 1.0000
The ggplot2 package is convenient for plotting principal components results as it allows many different covariates to be overlaid on the plot. See http://ggplot2.org/ for more informa- tion on this package.
library(ggplot2)
clusLabs <- cutree(clust.cor , k = 3)
pcRes <- data.frame(pca$rotation , SampleGroup , Sample = Patient)
ggplot (pcRes , aes(x = PC1 , y = PC2 , col = SampleGroup ,
label = Patient , pch = as.factor(clusLabs))) + geom_point () +
geom_text(vjust=0,alpha=0.5)
In this section we give an example of how to find a list of relevant pathways / GO terms from a list of differentially-expressed genes. We will use the colon cancer data that we down- loaded from GEO.
We are now going to create a gene universe by removing genes for will not contribute to the subsequent analysis. Such filtering is done without regarding the phenotype variables - hence a ”non-specific” filter. An Illumina Human6 chip contains around 48,00 probes, but less than half of these have enough detailed information to useful for a GO analysis. Therefore we re- strict the dataset to only probes for which we have a Entrez ID. It is also recommended to select probes with sufficient variability across samples to be interesting; as probes with little variability will no be interesting to the question we are trying to answer. The interquartile- range of each probe across all arrays is commonly used for this with a cut-off of 0.5.
library (illuminaHumanv3.db)
entrezIds = mget(rownames(exprs(colonData)) , illuminaHumanv3ENTREZID ,
ifnotfound = NA )
haveEntrezId = names(entrezIds)[sapply(entrezIds , function (x) !is.na(x))]
entrezSubset = exprs (colonData)[haveEntrezId , ]
entrezIQR = apply(entrezSubset, 1 , IQR)
selected = entrezIQR > 0.5
nsFiltered = entrezSubset [selected ,]
universeIds = unlist (mget(rownames(nsFiltered ) , illuminaHumanv3ENTREZID ,
ifnotfound = NA ))
Remember that the size of the universe can have an effect on the analysis. If the universe is made artificially large by including too many uninformative probes, the p-values for the GO terms will appear more significant.
We now test the genes in the universe to see which ones have significant differences between the two groups. For this, we use the rowttests function implemented in the genefilter pack- age, which performs a t-test for each row with respect to a factor. The p-values of the test can be extracted, with one p-value given for each probe.
library (GOstats)
## Loading required package: Category
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following object is masked from 'package:IRanges':
##
## expand
##
## Loading required package: GO.db
##
## Loading required package: graph
##
## Attaching package: 'GOstats'
##
## The following object is masked from 'package:AnnotationDbi':
##
## makeGOGraph
library (genefilter)
fac = as.factor(SampleGroup)
ttests = rowttests(as.matrix(nsFiltered) , fac)
smPV = ttests$p.value < 0.05
pvalFiltered = nsFiltered [smPV , ]
dim (pvalFiltered)
## [1] 3799 18
selectedEntrezIds = unlist (mget(rownames(pvalFiltered) ,
illuminaHumanv3ENTREZID , ifnotfound = NA))
The hyperGTest function is used to do the hypergeometric test for GO terms. Rather than passing a long list of parameters to the function. An object of type GOHyperGParams is created to hold all the parameters we need to run the hypergeometric test. This object can then be passed to hyperGTest multiple times without having to re-type the parameters each time. The meanings of these parameters are as follows:
params = new ("GOHyperGParams" , geneIds = selectedEntrezIds ,
universeGeneIds = universeIds , annotation = "illuminaHumanv3" ,
ontology = "BP" , pvalueCutoff = 0.05 , conditional = FALSE ,
testDirection = "over")
## Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
hgOver = hyperGTest(params)
hgOver
## Gene to GO BP test for over-representation
## 10029 GO BP ids tested (564 have p < 0.05)
## Selected gene set size: 2859
## Gene universe size: 6588
## Annotation package: illuminaHumanv3
The summary function can be used to view the results of the test in matrix form. The rows of the matrix are arranged in order of significance. The p-value is shown for each GO term along with with total number of genes for that GO term, number of genes we would be ex- pect to appear in the gene list by chance and that number that were observed. A descriptive name is also given for each term. The results can also be printed out to a HTML report us- ing htmlReport.
summary (hgOver)[1:20 , ]
## GOBPID Pvalue OddsRatio ExpCount Count Size
## 1 GO:0022613 1.237776e-11 2.929790 74.64299 118 172
## 2 GO:0042254 1.092750e-10 3.640835 48.60474 82 112
## 3 GO:1903047 1.382925e-08 1.736592 193.11703 250 445
## 4 GO:0034660 2.184857e-08 2.372528 75.94490 112 175
## 5 GO:0007059 2.959786e-08 2.791676 53.81239 84 124
## 6 GO:0051301 4.064662e-08 1.870589 138.00273 185 318
## 7 GO:0007067 4.746980e-08 2.155316 90.69991 129 209
## 8 GO:0016072 5.650406e-08 3.830055 32.11384 55 74
## 9 GO:0006364 7.719281e-08 3.894096 30.81193 53 71
## 10 GO:0010564 1.304123e-07 1.904748 119.77596 162 276
## 11 GO:0000280 1.571832e-07 1.961002 108.05874 148 249
## 12 GO:0070507 4.034994e-07 3.669752 29.51002 50 68
## 13 GO:0048285 4.565099e-07 1.877784 113.70036 153 262
## 14 GO:0022402 6.206269e-07 1.507148 273.83561 332 631
## 15 GO:0098813 1.118107e-06 2.939145 37.75546 60 87
## 16 GO:0000278 2.788264e-06 1.517878 228.70264 279 527
## 17 GO:0032886 3.647483e-06 2.906134 34.71767 55 80
## 18 GO:0006396 7.660386e-06 1.659162 137.13479 175 316
## 19 GO:0000070 7.916781e-06 3.099627 29.07605 47 67
## 20 GO:0003012 8.056638e-06 1.964313 76.37887 105 176
## Term
## 1 ribonucleoprotein complex biogenesis
## 2 ribosome biogenesis
## 3 mitotic cell cycle process
## 4 ncRNA metabolic process
## 5 chromosome segregation
## 6 cell division
## 7 mitotic nuclear division
## 8 rRNA metabolic process
## 9 rRNA processing
## 10 regulation of cell cycle process
## 11 nuclear division
## 12 regulation of microtubule cytoskeleton organization
## 13 organelle fission
## 14 cell cycle process
## 15 nuclear chromosome segregation
## 16 mitotic cell cycle
## 17 regulation of microtubule-based process
## 18 RNA processing
## 19 mitotic sister chromatid segregation
## 20 muscle system process
GOstats also has the facility to test for KEGG pathways and chromosome bands which are over-reprsented. The procedure of creating a gene universe and set of selected genes is the same. However, we have to use a different object for the parameters, as not all
keggParams = new ("KEGGHyperGParams" , geneIds = selectedEntrezIds ,
universeGeneIds = universeIds , annotation = "illuminaHumanv3" ,
pvalueCutoff = 0.05 , testDirection = "over" )
## Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
keggHgOver = hyperGTest (keggParams)
summary (keggHgOver)
##
## KEGG.db contains mappings based on older data because the original
## resource was removed from the the public domain before the most
## recent update was produced. This package should now be
## considered deprecated and future versions of Bioconductor may
## not have it available. Users who want more current data are
## encouraged to look at the KEGGREST or reactome.db packages
## KEGGID Pvalue OddsRatio ExpCount Count Size
## 1 03008 6.925084e-06 4.965818 17.303241 31 39
## 2 03013 4.268631e-04 2.446669 28.395062 42 64
## 3 03030 1.273628e-03 4.020631 11.091821 19 25
## 4 03020 3.349777e-03 Inf 3.105710 7 7
## 5 00230 4.485758e-03 1.843030 36.824846 49 83
## 6 00140 6.564118e-03 3.376837 9.760802 16 22
## 7 03040 9.196217e-03 1.999105 23.958333 33 54
## 8 00565 9.201126e-03 3.541901 8.429784 14 19
## 9 04110 1.144609e-02 1.679426 39.043210 50 88
## 10 04975 1.826333e-02 2.949531 8.873457 14 20
## 11 00240 1.837727e-02 1.874590 23.070988 31 52
## 12 04020 2.009462e-02 1.682742 31.944444 41 72
## 13 04610 4.616556e-02 1.818878 17.303241 23 39
## Term
## 1 Ribosome biogenesis in eukaryotes
## 2 RNA transport
## 3 DNA replication
## 4 RNA polymerase
## 5 Purine metabolism
## 6 Steroid hormone biosynthesis
## 7 Spliceosome
## 8 Ether lipid metabolism
## 9 Cell cycle
## 10 Fat digestion and absorption
## 11 Pyrimidine metabolism
## 12 Calcium signaling pathway
## 13 Complement and coagulation cascades
chrParams = new ("ChrMapHyperGParams" , geneIds = selectedEntrezIds ,
universeGeneIds = universeIds , annotation = "illuminaHumanv3" ,
pvalueCutoff = 0.05 , testDirection = "over" , conditional = TRUE )
## Warning in makeValidParams(.Object): removing duplicate IDs in geneIds
## Warning in makeValidParams(.Object): removing duplicate IDs in
## universeGeneIds
chrHgOver = hyperGTest (chrParams)
## Warning: unexpected band notation: Xq13.3-Xq21.2 using Xq
## Warning in makeChrMapToEntrez(chip, univ): dropping invalid items: tdb7990
summary (chrHgOver)
## ChrMapID Pvalue OddsRatio ExpCount Count Size
## 1 2p12 0.006558192 Inf 2.597003 6 6
## 2 5q1 0.006941990 1.863516 30.298368 41 70
## 3 5q35.1 0.010925116 5.910260 4.761172 9 11
## 4 Xp11 0.011416785 2.138478 18.179021 26 42
## 5 6q22.31 0.014104042 9.190211 3.462671 7 8
## 6 3q22.1 0.021018600 5.251937 4.328338 8 10
## 7 6p25 0.021365911 3.152839 7.358175 12 17
## 8 16q13 0.021365911 3.152839 7.358175 12 17
## 9 9p13 0.024369198 1.972750 17.313353 24 40
## 10 2q21 0.026927993 3.939244 5.194006 9 12
## 11 7q 0.028073272 1.290190 107.775622 123 249
## 12 6q22 0.028139668 2.463703 9.955178 15 23
## 13 7q33 0.028893629 7.874884 3.029837 6 7
## 14 20 0.028991119 1.320562 89.163768 103 206
## 15 3p21.1 0.031836512 3.282946 6.059674 10 14
## 16 3p2 0.031950526 1.371733 66.223575 78 153
## 17 8q2 0.033685661 1.351734 70.984747 83 164
## 18 5q12.1 0.035061273 Inf 1.731335 4 4
## 19 8q24.21 0.035061273 Inf 1.731335 4 4
## 20 Xp22.11 0.035061273 Inf 1.731335 4 4
## 21 16q1 0.035452834 2.896575 6.915382 11 16
## 22 10p13 0.035855530 2.889206 6.925341 11 16
## 23 12q2 0.037539376 1.407122 52.805727 63 122
## 24 10q11 0.039115387 2.626745 7.791009 12 18
## 25 8q24.1 0.039115387 2.626745 7.791009 12 18
## 26 10p15 0.039689671 4.594021 3.895504 7 9
## 27 10q 0.040364684 2.456356 8.622919 13 20
## 28 21q22.1 0.041737298 2.439301 8.656676 13 20
## 29 4q13 0.043825908 2.298743 9.522344 14 22
## 30 10q2 0.046678748 1.281341 88.730934 101 205
## 31 20q1 0.047805062 1.357580 57.999732 68 134
Ocassionally, we might want to check if a particular set of genes appear to be up- or down- regulated in an analysis. This can be checked using the geneSetTest function in limma, which computes a p-value to test the hypothesis that the selected genes have more extreme test-statistics than one might expect by chance. Moreover, separate tests can be performed to see if the selected genes are up-regulated (alternative=up) or down-regulated (alterna- tive=down). The default is to test for extreme statistics regardless of sign.
library (limma)
ccGenes <- unlist (mget("04110" , revmap (illuminaHumanv3PATH)))
ccInd <- which (rownames(nsFiltered) %in% ccGenes)
geneSetTest (index = ccInd , statistics = ttests$statistic)
## [1] 0.0001225606
geneSetTest (index = ccInd , statistics = ttests$statistic ,
alternative = "down" )
## [1] 6.774774e-18
geneSetTest (index = ccInd , statistics = ttests$statistic ,
alternative = "up" )
## [1] 1
barcodeplot (ttests$statistic , index = ccInd )
plot (density(ttests$statistic))
lines (density (ttests$statistic[ccInd]) , col = " red " )
The Bioconductor project has a collection of example datasets. Often these are used as ex- amples to illustrate a particular package or functionality, or to accompany the analysis pre- sented in a publication. For example, several datasets are presented to accompany the genefu package which has functions useful for the classification of breast cancer patients based on expression profiles. An experimental dataset can be installed and loaded as with any other Bioconductor pack- age. The data itself is saved as an object in the package. You will need to see the documen- tation for the package to find out the relevant object name. The full list of datasets available through Bioconductor can be found here
library(breastCancerVDX)
library(breastCancerTRANSBIG)
data(vdx)
data(transbig)
dim(vdx)
## Features Samples
## 22283 344
dim(transbig)
## Features Samples
## 22283 198
annotation(vdx)
## [1] "hgu133a"
annotation(transbig)
## [1] "hgu133a"
If we want any classifers to be reproducible and applicable to other datasets, it is sensible to exclude probes that do not have sufficient annotation from the analysis. For this, we can use the genefilter package as before. The nsFilter function performs this annotation-based filtering as well as variance filtering. The output of the function includes details about how many probes were removed at each stage of the filtering.
library (genefilter)
vdx.filt <- nsFilter(vdx)
##
vdx.filt
## $eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 6221 features, 344 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
## varLabels: samplename dataset ... e.os (21 total)
## varMetadata: labelDescription
## featureData
## featureNames: 206797_at 203440_at ... 201130_s_at (6221 total)
## fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 17420468
## Annotation: hgu133a
##
## $filter.log
## $filter.log$numDupsRemoved
## [1] 7408
##
## $filter.log$numLowVar
## [1] 6221
##
## $filter.log$numRemoved.ENTREZID
## [1] 2423
##
## $filter.log$feature.exclude
## [1] 10
vdx.filt <- vdx.filt[[1]]
Format the vdx data for pamr, and train a classifier to predict ER status. For extra clarity in the results, it might be useful to rename the binary er status used in the data package to something more descriptive.
library(pamr)
## Loading required package: survival
dat <- exprs(vdx.filt)
gN <- as.character(fData(vdx.filt)$Gene.symbol)
gI <- featureNames (vdx.filt)
sI <- sampleNames (vdx.filt)
erStatus <- pData (vdx)$er
erStatus <- gsub (0 , "ER -" , erStatus )
erStatus <- gsub (1 , "ER +" , erStatus )
Fitting the model
train.dat <- list ( x = dat , y = erStatus , genenames = gN ,
geneid = gI , sampleid = sI )
model <- pamr.train(train.dat ,n.threshold = 100)
## 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model
## Call:
## pamr.train(data = train.dat, n.threshold = 100)
## threshold nonzero errors
## 1 0.000 6221 41
## 2 0.130 5803 41
## 3 0.260 5365 39
## 4 0.390 4926 39
## 5 0.519 4501 39
## 6 0.649 4157 38
## 7 0.779 3789 38
## 8 0.909 3440 38
## 9 1.039 3144 38
## 10 1.169 2884 36
## 11 1.299 2624 34
## 12 1.428 2377 34
## 13 1.558 2197 34
## 14 1.688 2000 33
## 15 1.818 1823 34
## 16 1.948 1663 33
## 17 2.078 1538 33
## 18 2.208 1405 34
## 19 2.337 1281 33
## 20 2.467 1164 33
## 21 2.597 1067 32
## 22 2.727 994 32
## 23 2.857 910 33
## 24 2.987 823 33
## 25 3.117 753 33
## 26 3.246 697 32
## 27 3.376 643 31
## 28 3.506 577 32
## 29 3.636 522 33
## 30 3.766 460 33
## 31 3.896 398 33
## 32 4.026 360 33
## 33 4.155 319 33
## 34 4.285 284 33
## 35 4.415 246 33
## 36 4.545 219 33
## 37 4.675 191 33
## 38 4.805 170 34
## 39 4.935 155 34
## 40 5.064 132 34
## 41 5.194 117 35
## 42 5.324 103 35
## 43 5.454 89 36
## 44 5.584 77 36
## 45 5.714 68 36
## 46 5.844 63 36
## 47 5.973 57 36
## 48 6.103 54 37
## 49 6.233 49 37
## 50 6.363 44 37
## 51 6.493 41 37
## 52 6.623 37 37
## 53 6.753 34 36
## 54 6.882 32 37
## 55 7.012 28 37
## 56 7.142 27 37
## 57 7.272 25 38
## 58 7.402 23 39
## 59 7.532 19 41
## 60 7.661 17 42
## 61 7.791 17 42
## 62 7.921 17 43
## 63 8.051 17 43
## 64 8.181 16 43
## 65 8.311 15 44
## 66 8.441 13 43
## 67 8.570 11 48
## 68 8.700 8 48
## 69 8.830 7 48
## 70 8.960 6 51
## 71 9.090 6 54
## 72 9.220 6 56
## 73 9.350 5 57
## 74 9.479 4 58
## 75 9.609 4 59
## 76 9.739 4 63
## 77 9.869 3 66
## 78 9.999 3 69
## 79 10.129 3 73
## 80 10.259 3 79
## 81 10.388 3 100
## 82 10.518 3 118
## 83 10.648 2 134
## 84 10.778 2 135
## 85 10.908 2 135
## 86 11.038 1 135
## 87 11.168 1 135
## 88 11.297 1 135
## 89 11.427 1 135
## 90 11.557 1 135
## 91 11.687 1 135
## 92 11.817 1 135
## 93 11.947 1 135
## 94 12.077 1 135
## 95 12.206 1 135
## 96 12.336 1 135
## 97 12.466 1 135
## 98 12.596 1 135
## 99 12.726 1 135
## 100 12.856 0 135
We can perform cross-validation using the pamr.cv function. Printing the output of this function shows a table of how many genes were used at each threshold, and the number of classification errors. Both these values need to be taken into account when choosing a suit- able theshold. The pamr.plotcv function can assist with this by producing a diagnostic plot which shows how the error changes with the number of genes. In the plot produced by this function there are two panels; the top one shows the errors in the whole dataset and the bot- tom one considers each class separately. In each panel, the x axis corresponds to the thresh- old (and number of genes at each threshold) whereas the y-axis is the number of misclassifi- cations.
model.cv <- pamr.cv(model , train.dat , nfold = 10)
## 12Fold 1 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 2 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 3 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 4 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 5 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 6 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 7 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 8 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 9 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
## Fold 10 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model.cv
## Call:
## pamr.cv(fit = model, data = train.dat, nfold = 10)
## threshold nonzero errors
## 1 0.000 6221 43
## 2 0.130 5803 43
## 3 0.260 5365 43
## 4 0.390 4926 43
## 5 0.519 4501 42
## 6 0.649 4157 42
## 7 0.779 3789 40
## 8 0.909 3440 39
## 9 1.039 3144 39
## 10 1.169 2884 39
## 11 1.299 2624 40
## 12 1.428 2377 40
## 13 1.558 2197 39
## 14 1.688 2000 37
## 15 1.818 1823 38
## 16 1.948 1663 37
## 17 2.078 1538 38
## 18 2.208 1405 38
## 19 2.337 1281 38
## 20 2.467 1164 38
## 21 2.597 1067 38
## 22 2.727 994 37
## 23 2.857 910 36
## 24 2.987 823 36
## 25 3.117 753 36
## 26 3.246 697 36
## 27 3.376 643 36
## 28 3.506 577 35
## 29 3.636 522 34
## 30 3.766 460 35
## 31 3.896 398 34
## 32 4.026 360 34
## 33 4.155 319 34
## 34 4.285 284 34
## 35 4.415 246 34
## 36 4.545 219 34
## 37 4.675 191 34
## 38 4.805 170 34
## 39 4.935 155 35
## 40 5.064 132 35
## 41 5.194 117 36
## 42 5.324 103 36
## 43 5.454 89 36
## 44 5.584 77 36
## 45 5.714 68 37
## 46 5.844 63 37
## 47 5.973 57 38
## 48 6.103 54 38
## 49 6.233 49 38
## 50 6.363 44 38
## 51 6.493 41 39
## 52 6.623 37 39
## 53 6.753 34 40
## 54 6.882 32 38
## 55 7.012 28 38
## 56 7.142 27 38
## 57 7.272 25 39
## 58 7.402 23 40
## 59 7.532 19 41
## 60 7.661 17 42
## 61 7.791 17 42
## 62 7.921 17 42
## 63 8.051 17 43
## 64 8.181 16 44
## 65 8.311 15 44
## 66 8.441 13 45
## 67 8.570 11 47
## 68 8.700 8 49
## 69 8.830 7 49
## 70 8.960 6 52
## 71 9.090 6 55
## 72 9.220 6 58
## 73 9.350 5 58
## 74 9.479 4 61
## 75 9.609 4 61
## 76 9.739 4 64
## 77 9.869 3 66
## 78 9.999 3 69
## 79 10.129 3 75
## 80 10.259 3 88
## 81 10.388 3 109
## 82 10.518 3 117
## 83 10.648 2 126
## 84 10.778 2 134
## 85 10.908 2 135
## 86 11.038 1 135
## 87 11.168 1 135
## 88 11.297 1 135
## 89 11.427 1 135
## 90 11.557 1 135
## 91 11.687 1 135
## 92 11.817 1 135
## 93 11.947 1 135
## 94 12.077 1 135
## 95 12.206 1 135
## 96 12.336 1 135
## 97 12.466 1 135
## 98 12.596 1 135
## 99 12.726 1 135
## 100 12.856 0 135
pamr.plotcv(model.cv)
In the following sections, feel free to experiment with different values of the threshold (which we will call Delta) The misclassifications can easily be visualised as a ’confusion table’. This simply tabulates the classes assigned to each sample against the original label assigned to the sample. e.g. Misclassifications are samples that we thought were ’ER+’ but have been assigned to the ’ER-’ group by the classifier, or ’ER-’ samples assigned as ’ER+’ by the classifier.
Delta <- 8
pamr.confusion(model.cv , Delta)
## ER - ER + Class Error rate
## ER - 106 29 0.21481481
## ER + 14 195 0.06698565
## Overall error rate= 0.125
A visual representation of the class separation can be obtained using the pamr.plotcvprob function. For each sample there are two circles representing the probabilty of that sample being classified ER- (red) or ER+ (green).
pamr.plotcvprob(model , train.dat , Delta )
There are a couple of ways of extract the details of the genes that have been used in the classifier. We can list their names using the pamr.listgenes function, which in our case these are just returns the microarray probe names. We can however, use these IDs to query the featureData stored with the original vdx object. We can also plot the expression values for each gene, coloured according to the class label.
pamr.listgenes(model , train.dat , Delta )
## id ER --score ER +-score
## [1,] 205225_at -0.3257 0.2104
## [2,] 209173_at -0.202 0.1305
## [3,] 209603_at -0.1753 0.1133
## [4,] 204508_s_at -0.1184 0.0765
## [5,] 205009_at -0.0987 0.0638
## [6,] 214440_at -0.083 0.0536
## [7,] 205186_at -0.0583 0.0376
## [8,] 219197_s_at -0.0507 0.0327
## [9,] 209459_s_at -0.0453 0.0292
## [10,] 212956_at -0.0425 0.0274
## [11,] 218976_at -0.0402 0.026
## [12,] 203929_s_at -0.035 0.0226
## [13,] 204623_at -0.0325 0.021
## [14,] 215729_s_at 0.0295 -0.0191
## [15,] 209800_at 0.0211 -0.0136
## [16,] 218211_s_at -0.018 0.0116
## [17,] 205862_at -0.0109 0.0071
classifierGenes <- pamr.listgenes(model , train.dat , Delta )[,1]
## id ER --score ER +-score
## [1,] 205225_at -0.3257 0.2104
## [2,] 209173_at -0.202 0.1305
## [3,] 209603_at -0.1753 0.1133
## [4,] 204508_s_at -0.1184 0.0765
## [5,] 205009_at -0.0987 0.0638
## [6,] 214440_at -0.083 0.0536
## [7,] 205186_at -0.0583 0.0376
## [8,] 219197_s_at -0.0507 0.0327
## [9,] 209459_s_at -0.0453 0.0292
## [10,] 212956_at -0.0425 0.0274
## [11,] 218976_at -0.0402 0.026
## [12,] 203929_s_at -0.035 0.0226
## [13,] 204623_at -0.0325 0.021
## [14,] 215729_s_at 0.0295 -0.0191
## [15,] 209800_at 0.0211 -0.0136
## [16,] 218211_s_at -0.018 0.0116
## [17,] 205862_at -0.0109 0.0071
pamr.geneplot(model , train.dat ,Delta)
You may get an error message Error in plot.new(): Figure margins too large when trying to produce the gene plot. If this occurs, try increasing the size of your plotting window, or decrease the number of genes by decreasing the threshold. Alternatively, the fol- lowing code will write the plots to a pdf.
pdf ("classifierProfiles.pdf")
for (i in 1: length (classifierGenes)) {
Symbol <- fData(vdx.filt)[classifierGenes[i] , "Gene.symbol"]
boxplot(exprs(vdx.filt)[classifierGenes[i], ] ~ erStatus ,
main = Symbol )
}
dev.off()
## png
## 2
Use the genes identified by the classifier to produce a heatmap to confirm that they separate the samples as expected.
symbols <- fData(vdx.filt)[classifierGenes , "Gene.symbol"]
heatmap(exprs(vdx.filt)[classifierGenes, ] , labRow = symbols )
We can now test the classifier on an external dataset. We choose the transbig dataset for simplicity as it was generated on the same microarray platform
library (breastCancerTRANSBIG)
data (transbig)
pData (transbig)[1:4, ]
## samplename dataset series id filename size age er grade pgr her2 brca.mutation e.dmfs t.dmfs node t.rfs e.rfs treatment tissue t.os e.os
## VDXGUYU_4002 VDXGUYU_4002 TRANSBIG VDXGUYU 4002 GSM177885.CEL.gz 3.0 57 0 3 NA NA NA 1 723 0 723 1 0 1 937 1
## VDXGUYU_4008 VDXGUYU_4008 TRANSBIG VDXGUYU 4008 GSM177886.CEL.gz 3.0 57 1 3 NA NA NA 0 6591 0 183 1 0 1 6591 0
## VDXGUYU_4011 VDXGUYU_4011 TRANSBIG VDXGUYU 4011 GSM177887.CEL.gz 2.5 48 0 3 NA NA NA 1 524 0 524 1 0 1 922 1
## VDXGUYU_4014 VDXGUYU_4014 TRANSBIG VDXGUYU 4014 GSM177888.CEL.gz 1.8 42 1 3 NA NA NA 1 6255 0 2192 1 0 1 6255 1
transbig.filt <- transbig [featureNames(vdx.filt) , ]
predClass <- pamr.predict(model ,exprs(transbig.filt) ,Delta )
table (predClass, pData(transbig)$ er)
##
## predClass 0 1
## ER - 52 11
## ER + 12 123
boxplot (pamr.predict(model , exprs(transbig.filt), Delta ,
type = "posterior" )[, 1] ~ pData(transbig)$er)
Make a heatmap of the transbig data using the genes involved in the vxd classifier
erLab <- as.factor(pData(transbig)$er)
levels (erLab) <- c ("blue" , "yellow")
heatmap (exprs(transbig.filt)[classifierGenes , ] , labRow = symbols ,
ColSideColors = as.character (erLab))
An attractive feature of the vdx dataset is that it includes survival data for each breast can- cer patient. We are not explicitly covering survival analysis in this course, but for your ref- erence, here are the commands to create survival curves when patients are grouped by ER status and tumour grade.
library (survival)
par (mfrow = c (1 , 2))
plot (survfit (Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData(vdx)$er) , col = c("cyan" , "salmon"))
plot (survfit(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData (vdx)$grade) , col = c("blue" , "yellow" , "orange"))
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData (vdx)$er)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~
## pData(vdx)$er)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$er=0 135 38 45.5 1.244 2.04
## pData(vdx)$er=1 209 80 72.5 0.781 2.04
##
## Chisq= 2 on 1 degrees of freedom, p= 0.154
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData(vdx)$grade)
## Call:
## survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~
## pData(vdx)$grade)
##
## n=197, 147 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## pData(vdx)$grade=1 7 0 3.06 3.06 3.21
## pData(vdx)$grade=2 42 10 16.69 2.68 3.53
## pData(vdx)$grade=3 148 61 51.26 1.85 6.72
##
## Chisq= 7.7 on 2 degrees of freedom, p= 0.0218